mnist_raw <- read_csv("https://pjreddie.com/media/files/mnist_train.csv", col_names = FALSE)
mnist_raw_test <- read_csv("https://pjreddie.com/media/files/mnist_test.csv", col_names = FALSE)The MNIST dataset is a compilation of handwritten digits which have been digitzed for use in supervised machine learning classification applications. The datset is fairly large with 70,000 observations where each observation is a handwritten digit from various subjects. As a result of this diversity, each handwritten digit of the same number can have differences due to penmanship style and variation within the same penmanship style.
Each handwritten digit is on a 28x28 pixel image. The digits per observation range from 0 to 9 where each observation/digit in the 28x28 pixel have had their size normalized and have been centered on the image canvas.
Below is a sample of the number 3 from one of the observations in the dataset. Note that the images are grayscale
m <- t(matrix(mnist_raw[8,] %>% select(-X1), ncol = 28))
m2 <- matrix(unlist(m), nrow = 28, byrow = FALSE)
dimnames(m) <-list(rep("", dim(m)[1]), rep("", dim(m)[2]))
rotate <- function(x) t(apply(x, 2, rev))
image(rotate(m2), col=gray((255:0)/255))Each every pixel on the canvas is represented by a integer range from 0 to 255, where 0 means the pixel is completely white and 255 means the pixel is completely black, the ranges from 1 to 254 are the various shades of the color gray. Since each image is 28x28 pixels in size, then each image can be represented by a 28x28 size matrix. Below is an a matrix representation of the number 3 (observation 8)
## Label: 3
##
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 38 43 105 255 253 253 253 253 253 174 6
## 0 0 0 0 0 0 0 0 0 43 139 224 226 252 253 252 252 252 252 252 252 158
## 0 0 0 0 0 0 0 0 0 178 252 252 252 252 253 252 252 252 252 252 252 252
## 0 0 0 0 0 0 0 0 0 109 252 252 230 132 133 132 132 189 252 252 252 252
## 0 0 0 0 0 0 0 0 0 4 29 29 24 0 0 0 0 14 226 252 252 172
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 85 243 252 252 144
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 88 189 252 252 252 14
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 91 212 247 252 252 252 204 9
## 0 0 0 0 0 0 0 0 0 32 125 193 193 193 253 252 252 252 238 102 28 0
## 0 0 0 0 0 0 0 0 45 222 252 252 252 252 253 252 252 252 177 0 0 0
## 0 0 0 0 0 0 0 0 45 223 253 253 253 253 255 253 253 253 253 74 0 0
## 0 0 0 0 0 0 0 0 0 31 123 52 44 44 44 44 143 252 252 74 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 252 252 74 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 86 252 252 74 0 0
## 0 0 0 0 0 0 5 75 9 0 0 0 0 0 0 98 242 252 252 74 0 0
## 0 0 0 0 0 61 183 252 29 0 0 0 0 18 92 239 252 252 243 65 0 0
## 0 0 0 0 0 208 252 252 147 134 134 134 134 203 253 252 252 188 83 0 0 0
## 0 0 0 0 0 208 252 252 252 252 252 252 252 252 253 230 153 8 0 0 0 0
## 0 0 0 0 0 49 157 252 252 252 252 252 217 207 146 45 0 0 0 0 0 0
## 0 0 0 0 0 0 7 103 235 252 172 103 24 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 14 0 0 0 0 0
## 59 0 0 0 0 0
## 59 0 0 0 0 0
## 7 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## 0 0 0 0 0 0
To represent each matrix as an observation, each digit matrix has been flattened (converting from a multidimensional array to a single dimensional array) such that each observation is a integer list of (\(28x28=784\)) length 784, where each value in the list is the pixel value ranging from 0 to 255. This list has an additional value which contains a number from the range 0-9 containing the label value of the number represented by the image, bringing the length per observation to 785.
Bringing this all together, the dataset has 70,000 observations (60,000 training and 10,000 testing), 784 features and 1 classification label. The dataset of digital images is now represented mathematically that we can perform additional analysis and modeling on it.
# Reduce the dataset down from 60,000 observations
mnist_subset <- mnist_raw %>% head(5000)
# Relabel X1 and add instance number
mnist_subset <- mnist_subset %>% rename(label = X1) %>% mutate(instance = row_number())
# Gather columns into x, y values
mnist_subset